Skip to content

optimize u128/i128 popcounts further #354

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Oct 19, 2023

Conversation

sadlerap
Copy link
Contributor

Don't fall back on breaking apart the popcount operation if 128-bit integers are natively supported.

@antoyo
Copy link
Contributor

antoyo commented Oct 18, 2023

Thanks!

Do you think it would be worth trying to make it work even for non-native integers by using the operations from the int module as I suggested earlier?

@sadlerap
Copy link
Contributor Author

I'll give it a try - if it ends up working, I'll replace what I've posted here with whatever works. If not, I'll let you know what fails.

@sadlerap
Copy link
Contributor Author

I think I got a bug in libgccjit.

When I compile with this patch, I get the following error:

Details

rustc_codegen_gcc is build for rustc 1.75.0-nightly (97c81e1b5 2023-10-07) but the default rustc version is rustc 1.75.0-nightly (49691b1f7 2023-10-16).
Using rustc 1.75.0-nightly (97c81e1b5 2023-10-07).
   Compiling minmax_test v0.1.0 (/home/andy/repos/minmax_test)
libgccjit.so: warning: command-line option ‘-x86-asm-syntax=intel’ is valid for Modula-2 but not for
error: rustc interrupted by SIGSEGV, printing backtrace

/home/andy/.rustup/toolchains/nightly-2023-10-08-x86_64-unknown-linux-gnu/lib/librustc_driver-3558b9258144fc36.so(+0x2aff526)[0x7f00466ff526]
/usr/lib/libc.so.6(+0x3e710)[0x7f00438c6710]
/home/andy/repos/gcc/build/gcc/libgccjit.so.0(+0x592255)[0x7f0038392255]
/home/andy/repos/gcc/build/gcc/libgccjit.so.0(+0x592f78)[0x7f0038392f78]
/home/andy/repos/gcc/build/gcc/libgccjit.so.0(+0x597d84)[0x7f0038397d84]

### cycle encountered after 5 frames with period 6
/home/andy/repos/gcc/build/gcc/libgccjit.so.0(+0x592780)[0x7f0038392780]
/home/andy/repos/gcc/build/gcc/libgccjit.so.0(+0x59e41d)[0x7f003839e41d]
/home/andy/repos/gcc/build/gcc/libgccjit.so.0(+0x5930a4)[0x7f00383930a4]
/home/andy/repos/gcc/build/gcc/libgccjit.so.0(+0x595f8d)[0x7f0038395f8d]
/home/andy/repos/gcc/build/gcc/libgccjit.so.0(+0x595fc7)[0x7f0038395fc7]
/home/andy/repos/gcc/build/gcc/libgccjit.so.0(+0x597e87)[0x7f0038397e87]
### recursed 41 times

/home/andy/repos/gcc/build/gcc/libgccjit.so.0(+0x592780)[0x7f0038392780]
/home/andy/repos/gcc/build/gcc/libgccjit.so.0(+0x59e41d)[0x7f003839e41d]
/home/andy/repos/gcc/build/gcc/libgccjit.so.0(+0x5930a4)[0x7f00383930a4]
/home/andy/repos/gcc/build/gcc/libgccjit.so.0(+0x595f8d)[0x7f0038395f8d]
/home/andy/repos/gcc/build/gcc/libgccjit.so.0(+0x595fc7)[0x7f0038395fc7]

note: rustc unexpectedly overflowed its stack! this is a bug
note: maximum backtrace depth reached, frames may have been lost
note: we would appreciate a report at https://github.com/rust-lang/rust
note: backtrace dumped due to SIGSEGV! resuming signal
error: could not compile `minmax_test` (lib)

Caused by:
  process didn't exit successfully: `/home/andy/.rustup/toolchains/nightly-2023-10-08-x86_64-unknown-linux-gnu/bin/rustc --crate-name minmax_test --edition=2021 src/lib.rs --error-format=json --json=diagnostic-rendered-ansi,artifacts,future-incompat --diagnostic-width=141 --crate-type lib --emit=dep-info,metadata,link -C opt-level=3 -C embed-bitcode=no --emit asm -Ccodegen-units=1 -C llvm-args=-x86-asm-syntax=intel -Cdebuginfo=2 -C metadata=19b01f69af993630 -C extra-filename=-19b01f69af993630 --out-dir /home/andy/repos/minmax_test/target/release/deps -L dependency=/home/andy/repos/minmax_test/target/release/deps -Csymbol-mangling-version=v0 -Cdebuginfo=2 -Clto=off -Zcodegen-backend=/home/andy/repos/rustc_codegen_gcc/target/release/librustc_codegen_gcc.so --sysroot /home/andy/repos/rustc_codegen_gcc/build_sysroot/sysroot` (signal: 11, SIGSEGV: invalid memory reference)

Cargo failed with exit status: 101

@antoyo
Copy link
Contributor

antoyo commented Oct 18, 2023

Indeed. Good find!
You can see in the CI that it's because you use an array for a binary operation.
I presume it's in the call to add_assignment_op().
libgccjit doesn't check that the values in this call are numeric.
I'll fix this!
Thanks!

@sadlerap
Copy link
Contributor Author

hmm, I wonder if replacing the assignment ops with just plain assignments would work. I'll give it a try.

@antoyo
Copy link
Contributor

antoyo commented Oct 19, 2023

I added the type check in this commit.

Yeah, that might work indeed.

@sadlerap
Copy link
Contributor Author

Sadly, while subbing out assignment ops with pure assignments works around the bug, it also results in poor codegen when 128-bit integers aren't supported and the fallback is skipped.

Details

Before:

popcount_128:

        xor eax, eax
        popcnt rax, rsi
        test rsi, rsi
        cmove rax, rsi

        xor ecx, ecx
        popcnt rcx, rdi
        mov rdx, rax
        add rax, rcx

        test rdi, rdi
        cmove rax, rdx

        ret

After:

popcount_128:

        push r14

        mov rax, rsi
        push r13
        push r12
        push rbp
        push rbx
        sub rsp, 32

        or rax, rdi
        mov QWORD PTR 16[rsp], rdi

        mov QWORD PTR 24[rsp], rsi
        vmovdqa xmm3, XMMWORD PTR 16[rsp]
        vmovdqa XMMWORD PTR [rsp], xmm3
        je .L67
        mov rbp, rsi

        mov rbx, rdi

        xor r12d, r12d

.L66:
        mov r14, QWORD PTR [rsp]
        mov r13, QWORD PTR 8[rsp]
        xor ecx, ecx
        mov edx, 1
        add r12d, 1
        mov rdi, r14
        mov rsi, r13
        call __rust_u128_sub@PLT

        xor ecx, ecx
        mov edx, 1
        mov rdi, r14
        mov rsi, r13
        and rbx, rax
        mov QWORD PTR 16[rsp], rax
        call __rust_u128_sub@PLT

        vmovq xmm0, QWORD PTR 16[rsp]
        mov rax, rbx
        and rbp, rdx
        vpinsrq xmm0, xmm0, rdx, 1
        vpand xmm1, xmm0, XMMWORD PTR [rsp]
        or rax, rbp
        vmovdqa XMMWORD PTR [rsp], xmm1
        jne .L66
        mov eax, r12d
.L63:
        add rsp, 32
        pop rbx

        pop rbp

        pop r12

        pop r13
        pop r14

        ret
.L67:

        xor eax, eax
        jmp .L63

However, when 128-bit integers are supported, we do see the improvement as expected.

Details

Before:

popcount_128:

        xor eax, eax
        popcnt rax, rsi
        test rsi, rsi
        cmove rax, rsi

        xor ecx, ecx
        popcnt rcx, rdi
        mov rdx, rax
        add rax, rcx

        test rdi, rdi
        cmove rax, rdx

        ret

After:

popcount_128:

        mov rdx, rdi
        or rdx, rsi

        je .L65

        xor ecx, ecx

        popcnt rsi, rsi

        popcnt rcx, rdi
        lea eax, [rsi+rcx]
        ret
.L65:

        xor eax, eax
        ret

I think if we gate the fallback behind native support for 128-bit integers, we'll get the best of both worlds - good-ish codegen when 128-bit integers are supported, and a roughly as-good fallback when they aren't.

@antoyo
Copy link
Contributor

antoyo commented Oct 19, 2023

Do you know (or did you do some benchmarks to check) if having one jump is indeed faster than the 2 conditional moves?

@sadlerap
Copy link
Contributor Author

I wrote some microbenchmarks with criterion, but they seemed to get stuck on the anaylsis step (let it run for about 15 minutes and it still wasn't done, whereas the same code compiled with llvm completed in a few seconds). I'll try some other options, see if I get more luck there.

@sadlerap
Copy link
Contributor Author

sadlerap commented Oct 19, 2023

Alright, I think I have some solid numbers (at least for my machine).

My benchmarks run popcount 1000 times on the same number and add the results together. This is done to inflate iteration times a bit - with an operation as small as this, it's hard to get good numbers for how long it takes to complete.

Before:

test tests::bench_popcount_u128_0       ... bench:         668 ns/iter (+/- 57)
test tests::bench_popcount_u128_1       ... bench:         680 ns/iter (+/- 9)
test tests::bench_popcount_u128_10      ... bench:         683 ns/iter (+/- 12)
test tests::bench_popcount_u128_minus_1 ... bench:         705 ns/iter (+/- 16)

After:

test tests::bench_popcount_u128_0       ... bench:         580 ns/iter (+/- 5)
test tests::bench_popcount_u128_1       ... bench:         592 ns/iter (+/- 91)
test tests::bench_popcount_u128_10      ... bench:         595 ns/iter (+/- 186)
test tests::bench_popcount_u128_minus_1 ... bench:         593 ns/iter (+/- 12)

Source:

#![feature(test)]

pub fn popcount_128(a: u128) -> u32 {
    a.count_ones()
}

#[cfg(test)]
pub mod tests {
    use super::*;
    use test::{Bencher, black_box};

    #[bench]
    fn bench_popcount_u128_0(b: &mut Bencher) {
        let x = black_box(0u128);
        b.iter(|| (0..1000).fold(0, |old, _new| old + popcount_128(black_box(x))))
    }

    #[bench]
    fn bench_popcount_u128_1(b: &mut Bencher) {
        let x = black_box(1u128);
        b.iter(|| (0..1000).fold(0, |old, _new| old + popcount_128(black_box(x))))
    }

    #[bench]
    fn bench_popcount_u128_10(b: &mut Bencher) {
        let x = black_box(10u128);
        b.iter(|| (0..1000).fold(0, |old, _new| old + popcount_128(black_box(x))))
    }

    #[bench]
    fn bench_popcount_u128_minus_1(b: &mut Bencher) {
        let x = black_box(-1i128 as u128);
        b.iter(|| (0..1000).fold(0, |old, _new| old + popcount_128(black_box(x))))
    }

}

Not sure if other processor types will see similar results - I'm on a R7 5800X, but you might see different results on Intel processors/older AMD processors.

I think a lot of the branch/conditional moves will go away once #351 gets resolved, so this might not even matter in the long term.

Don't fall back on breaking apart the popcount operation if 128-bit
integers are natively supported.

Signed-off-by: Andy Sadler <[email protected]>
@antoyo antoyo merged commit 2096606 into rust-lang:master Oct 19, 2023
@antoyo
Copy link
Contributor

antoyo commented Oct 19, 2023

Thanks a lot for your work!

@sadlerap sadlerap deleted the popcount-128-bits branch October 19, 2023 11:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants